Doubly Robust Policy Evaluation and Optimization
نویسندگان
چکیده
منابع مشابه
Doubly Robust Policy Evaluation and Optimization
We study sequential decision making in environments where rewards are only partially observed, but can be modeled as a function of observed contexts and the chosen action by the decision maker. This setting, known as contextual bandits, encompasses a wide variety of applications such as health care, content recommendation and Internet advertising. A central task is evaluation of a new policy gi...
متن کاملMore Robust Doubly Robust Off-policy Evaluation
We study the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of a policy from the data generated by another policy(ies). In particular, we focus on the doubly robust (DR) estimators that consist of an importance sampling (IS) component and a performance model, and utilize the low (or zero) bias of IS and low variance of the mo...
متن کاملDoubly Robust Policy Evaluation and Learning
We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including health-care policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions an...
متن کاملDoubly Robust Off-policy Evaluation for Reinforcement Learning
We study the problem of evaluating a policy that is different from the one that generates data. Such a problem, known as off-policy evaluation in reinforcement learning (RL), is encountered whenever one wants to estimate the value of a new solution, based on historical data, before actually deploying it in the real system, which is a critical step of applying RL in most real-world applications....
متن کاملDoubly Robust Off-policy Value Evaluation for Reinforcement Learning
Proof. For the base case t = H + 1, since V 0 DR = V (s H+1) = 0, it is obvious that at the (H + 1)-th step the estimator is unbiased with 0 variance, and the theorem holds. For the inductive step, suppose the theorem holds for step t + 1. At time step t, we have: V t V H+1−t DR = E t V H+1−t DR 2 − E t V (s t) 2 = E t V (s t) + ρ t r t + γV H−t DR − Q(s t , a t) 2 − V (s t) 2 + V t V (s t) = E...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Statistical Science
سال: 2014
ISSN: 0883-4237
DOI: 10.1214/14-sts500